1. Cleanup

This notebook is about classifying Challenge data according to observation type, and some cleanup of inconsistent terms. This notebook records some data operations on Soil Carbon Challenge data to prepare it for insertion in atlasbiowork database. The munging and cleanup process will benefit from using both Excel and pandas for inspection and cleanup.



In [2]:

    
import pandas as pd
#pd.set_option('mode.sim_interactive', True)
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import json, csv
import re



In [10]:

    
df = pd.read_csv('/Users/Peter/Documents/atlas/atlasdata/data.xls.csv', encoding='latin1')
df = df.replace(np.nan,' ', regex=True)#MUST REPLACE NAN AS NAN IS NOT ITERABLE
#df.dtypes



In [13]:

    
df = pd.read_csv('/Users/Peter/Documents/scc/challenge/obs_types/change.csv')

search and replace in dataframe

First, get the strings you'd like to standardize. Then, enter your lists and replacement term. You have to target specific columns.



In [94]:

    
#GET STRINGS OF LABELS "list_of_terms"
a1 = df.label1.unique()
a2 = df.label2.unique()
a3 = df.label3.unique()
#a4 = df.label4.unique()
#a5 = df.label5.unique()
list_of_terms = np.concatenate([a1,a2,a3])
#sorted(list_of_terms)



In [4]:

    
#SEARCH AND REPLACE
#searchterm = 'lichen'
#replaceterm = 'squish'
#the_list = [item for item in list_of_terms if searchterm in str(item)]

#need to target specific columns

mycols=df[['label1', 'label2','label3','label4','label5']]
#mycols.replace('=','wox', regex=True)

#mycols.replace(to_replace=the_list, value=replaceterm, inplace=True)



In [12]:

    
#df = df.replace(np.nan,' ', regex=True)
df.label4[df.label4.str.contains('litter')]









    Out[12]:





4      litter
5      litter
54     litter
55     litter
56     litter
140    litter
269    litter
Name: label4, dtype: object



In [16]:

    
searchterm = 'lichen'
replaceterm = 'moss/algae/lichen'
the_list = [item for item in list_of_terms if searchterm in str(item)]

#need to target specific columns

mycols=df[['label1', 'label2','label3','label4','label5']]
mycols.describe()

#mycols.replace(to_replace=the_list, value=replaceterm, inplace=True)



In [6]:

    
searchterm = 'lichen'
replaceterm = 'moss/algae/lichen'
the_list = [item for item in list_of_terms if searchterm in str(item)]

#need to target specific columns

df[['label1', 'label2','label3','label4','label5']].replace(to_replace=thelist, value=replaceterm, inplace=True)









    Out[6]:





['moss/algae/lichen',
 'moss/algae/lichen',
 '%lichen=',
 'moss/algae/lichen',
 'moss/algae/lichen']



In [29]:

    
transects.to_csv('/Users/Peter/Documents/scc/challenge/PSQL/transectsOct23.csv', index=False)



In [33]:

    
linephotos = df[(df.type.str.contains('line'))]
angphotos = df[(df.type.str.contains('ang')) | (df.note.str.contains('step back'))]
vertphotos = df[df.type.str.contains('vert')]
len(vertphotos)
#re.findall('\d+', s) #finds digits in s
def get_num(x):
    digits = ''.join(ele for ele in x if ele.isdigit())
    if digits:
        return int(digits)
    pass

#get_num('Hoop 1, 125\'')
#df.ix[459]
for y in range(len(df)):









    Out[33]:





6312



In [63]:

    
#basic row selection from http://www.gregreda.com/2013/10/26/working-with-pandas-dataframes/ 
#which has GREAT info on joins
peter_plots = df[(df.lat > 0) & (df.observer.str.contains('Peter Donovan'))]
features = df[(df.type == '*plot summary')|(df.type == 'change')|(df.type.str.contains('remonitor'))];
#df.iloc[100] and df.ix[100] get the row referred to by the default 0-based index
# df.loc['Kellogg LTER'] doesn't work because it's not an index; 
# dfnew = df.set_index('id'); this works even tho id is not unique
#dfnew.loc['Kellogg LTER'] and this works; use inplace=True as arg to modify existing df
# dfnew.loc['BURR1'] returns all rows for this index

#column selector
#df[['type','label3']]; need to use double [] to enclose a list
#new column
#df['new'] = df.lat + 2



In [60]:

    
#df['featureID'] = df.id.str[0:6]  #str function slices string
#df.type  #although type is a keyword this works

	label1	label2	label3	label4	label5
count	1308	1308	1308	1308	1308
unique	3	12	17	11	7
top
freq	745	940	864	1145	1228